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Abstract 

This paper introduces a novel framework for modeling interacting humans in a 
multi-stage game environment by combining concepts from game theory and re- 
inforcement learning. The proposed model has the following desirable charac- 
teristics: (1) Bounded rational players, (2) strategic (i.e., players account for one 
another’s reward functions), and (3) is computationally feasible even on moder- 
ately large real-world systems. To do this we extend level-K reasoning to policy 
space to, for the first time, be able to handle multiple time steps. This allows us to 
decompose the problem into a series of smaller ones where we can apply standard 
reinforcement learning algorithms. We investigate these ideas in a cyber-battle 
scenario over a smart power grid and discuss the relationship between the behav- 
ior predicted by our model and what one might expect of real human defenders 
and attackers. 


1 Introduction 

We present a model of interacting human beings that advances the literature by combining con- 
cepts from game theory and computer science in a novel way. In particular, wc introduce the first 
time-extended level-K game theory model [1,2]. This allows us to use reinforcement learning (RL) 
algorithms to learn each player’s optimal policy against the level K — 1 policies of the other players. 
However, rather than formulating policies as mappings from belief states to actions, as in partially 
observable Markov decision processes (POMDPs), we formulate policies more generally as map- 
pings from a player's observations and memory to actions. Here, memory refers to all of a player’s 
past observations. 

This model is the first to combine all of the following characteristics. First, players arc strategic in 
the sense that their policy choices depend on the reward functions of the other players. This is in 
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Figure 1 : An example semi Bayes net. 


Figure 2: An example iterated semi 
Bayes net. 


contrast to learning-in-games models whereby players do not use their opponents’ reward informa- 
tion to predict their opponents’ decisions and to choose their own actions. Second, this approach 
is computationally feasible even on real-world problems. This is in contrast to equilibrium models 
such as subgame perfect equilibrium and quantal response equilibrium [3]. This is also in contrast to 
POMDP models (e.g. I-POMDP) in which players are required to maintain a belief state over spaces 
that quickly explode. Third, with this general formulation of the policy mapping, it is straightfor- 
ward to introduce experimentally motivated behavioral features such as noisy, sampled or bounded 
memory. Another source of realism is that, with the level-K model instead of an equilibrium model, 
we avoid the awkward assumption that players’ predictions about each other are always correct. 

We investigate all this for modeling a cyber-battle over a smart power grid. We discuss the rela- 
tionship between the behavior predicted by our model and what one might expect of real human 
defenders and attackers. 


2 Game Representation and Solution Concept 

In this paper, the players will be interacting in an iterated semi net-form game. To explain an iterated 
semi net-form game, we will begin by describing a semi Bayes net. A semi Bayes net is a Bayes 
net with the conditional distributions of some nodes left unspecified. A pictoral example of a semi 
Bayes net is given in Figure 1. Like a standard Bayes net, a semi Bayes net consist of a set of nodes 
and directed edges. The ovular nodes labeled “S” have specified conditional distributions with the 
directed edges showing the dependencies among the nodes. Unlike a standard Bayes net, there are 
also rectangular nodes labeled “U” that have unspecified conditional dependencies. In this paper, 
the unspecified distributions will be set by the interacting human players. A semi net-form game, 
as described in [4], consists of a semi Bayes net plus a reward function mapping the outcome of the 
semi Bayes net to rewards for the players. 

An iterated semi Bayes net is a Bayes net which has been time extended. It comprises of a semi 
Bayes net (such as the one in Figure 1), which is replicated T times. Figure 2 shows the semi Bayes 
net replicated three times. A set of directed edges L sets the dependencies between two successive 
iterations of the semi Bayes net. Each edge in L connects a node in stage t — 1 with a node in stage 
t as is shown by the dashed edges in Figure 2. This set of L nodes is the same between every two 
successive stages. An iterated semi net-form game comprises of two parts: an iterated semi Bayes 
net and a set of reward functions which map the results of each step of the semi Bayes net into an 
incremental reward for each player. In Figure 2, the unspecified nodes have been labeled u Ua” and 
“U b” to specify which player sets which nodes. 

Having described above our model of the strategic scenario in the language of iterated semi net-form 
games, we now describe our solution concept. Our solution concept is a combination of the level-K 
model, described below, and reinforcement learning (RL). The level-K model is a game theoretic 
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solution concept used to predict the outcome of human-human interactions. A number of studies 
[1,2] have shown promising results predicting experimental data in games using this method. The 
solution to the lcvel-K model is defined recursively as follows. A level K player plays as though all 
other players are playing at level K — 1, who, in turn, play as though all other players are playing 
at level K — 2, etc. The process continues until level 0 is reached, where the level 0 player plays 
according to a prespecified prior distribution. Notice that running this process for a player at K >2 
results in ricocheting between players. For example, if player A is a level 2 player, he plays as 
though player B is a level 1 player, who in turn plays as though player A is a level 0 player playing 
according to the prior distribution. Note that player B in this example may not actually be a level 1 
player in reality - only that player A assumes him to be during his reasoning process. 

This work extends the standard level-K model to time-extended strategic scenarios, such as iterated 
semi net-form games. In particular, each Undetermined node associated with player i in the iterated 
semi net-form game represents an action choice by player i at some time t. We model player i’s 
action choices using the policy function, p it which takes an element of the Cartesian product of 
the spaces given by the parent nodes of i’s Undetermined node to an action for player i. Note that 
this definition requires a special type of iterated semi-Bayes net in which the spaces of the parents 
of each of i’s action nodes must be identical. This requirement ensures that the policy function is 
always well-defined and acts on the same domain at every step in the iterated semi net-form game. 
We calculate policies using reinforcement learning (RL) algorithms. That is, we first define a level 
0 policy for each player, p\. We then use RL to find player i’s level 1 policy, p] , given the level 0 
policies of the other players, p (i , and the iterated semi net-form game. We do this for each player i 
and each level K. 1 

3 Application: Cybersecurity of a Smart Power Network 

In order to test our iterated semi net-form game modeling concept, we adopt a model for analyz- 
ing the behavior of intruders into cyber-physical systems. In particular, we consider Supervisory 
Control and Data Acquisition (SCADA) systems [5], which are used to monitor and control many 
types of critical infrastructure. A SCADA system consists of cyber-communication infrastructure 
that transmits data from and sends control commands to physical devices, e.g. circuit breakers in 
the electrical grid. SCADA systems are partly automated and partly human-operated. Increasing 
connection to other cyber systems creating vulnerabilities to SCADA cyber attackers [6]. 

Figure 3 shows a single, radial distribution circuit [7] from the transformer at a substation (node 
1) serving two load nodes. Node 2 is an aggregate of small consumer loads distributed along the 
circuit, and node 3 is a relatively large distributed generator located near the end of the circuit. In 
this figure Vi, pi, and q, are the voltage, real power, and reactive power at node i. Pi,Qi, r , . and 
Xi are the real power, reactive power, resistance and reactance of circuit segment i. Together, these 
values represent the following physical system [7], where all terms are normalized by the nominal 
system voltage. 

Pi = — P3, Qi = —< 73 * Pi = Pi +Ph Qi = Qi+ qi ( 1 ) 

V 2 = V 1 - (nPi + xiQi), V 3 = Vi - ( r 2 p 2 + x 2 Qi ) (2) 

In this model, r, x, and p 3 are static parameters, q 2 and p 2 are drawn from a random distribution 
at each step of the game, V\ is the decision variable of the defender, r/ ;4 is the decision variable of 
the attacker, and V 2 and V 3 are determined by the equations above. The injection of real power 
pa and reactive power 53 can modify the P, and Qi causing the voltage V 2 to deviate from 1.0. 
Excessive deviation of V 2 or V3 can damage customer equipment or even initiate a cascading failure 
beyond the circuit in question. In this example, the SCADA operator’s (defender’s) control over q 3 
is compromised by an attacker who seeks to create deviations of V 2 causing damage to the system. 

In this model, the defender has direct control over I j via a variable-tap transformer. The hardware 
of the transformer limits the defenders actions at time t to the following domain 

D v (t) = (min(u TOax , V M _j 4* v), max(u mjn , V M _i - v)) 

'Although this work uses level-K and RL exclusively, we are by no means wedded to this solution concept. 
Previous work on semi net-form games used a method known as Level-K Best-of-M/M" instead of RL to 
determine actions. This was not used in this paper because the possible action space is so large. 
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Figure 3: Schematic drawing of the three-node distribution circuit. 


where v is the voltage step size for the transformer, and u Tmn and v max represent the absolute min 
and max voltage the transformer can produce. Similarly, the attacker has taken control of q 3 and 
its actions are limited by its capacity to produce real power, p 3/ , nax as represented by the following 
domain. 

D A {t) = ( P3 ,maxi • • • ? 0 , . . . ,P3,max) 

Via the SCADA system and the attacker’s control of node 3, the observation spaces of the two 
players includes 

Qd = {Vi, V 2 , V 3 ,Pi, Qi, M-p}, = {V 2 ,V 3 ,p 3 , q 3 , M^} 

where ATp and A /4 are used to denote each two real numbers that represent the respective player’s 
memory of the past events in the game. Both the defender and attacker manipulate their controls in 
a way to increase their own rewards. The defender desires to maintain a high quality of service by 
maintaining the voltages V 2 and V 3 near the desired normalized voltage of one while the attacher 
wishes to damage equipment at node 2 by forcing V2 beyond operating limits, i.e. 

r v = _(Y^z 1 ^ -(^7^) > r a <-) Vi, - ( 1 + r ),- <■-) 1 0 - V->] 

Here, e ~ 0.05 for most distribution system under consideration, 0 is a Heaviside step function. 

Level 1 defender policy The level 0 defender is modeled myopically and seeks to maximize his 
reward by following a policy that adjusts Vi to move the average of Vi and V 3 closer to one, i.e. 

(Vi + Vi) 

nv(V 2 ,V 3 ) = argmin VieDo(t) ^ — 1 

Level 1 attacker policy The level 0 attacker adopts a drift and strike policy based on intimate 
knowledge of the system. If Vi < 1, we propose that the attacker would decrease q 3 by lowering 
it by one step. This would cause Q 1 to increase and Vi to fall even farther. This policy achieves 
success if the defender raises Vi in order to keep Vi and V 3 in the acceptable range. The attacker 
continues this strategy, pushing the defender towards v max until he can quickly raise q 3 to push Vi 
above 1 + e . If the defender has neared v m ax , then a number of time steps will be required to for 
the defender to bring Vi back in range. More formally this policy can be expressed as 
LevelOAttackerQ 

1 V* = max, eD ^( t ) | Vi - 1 1 ; 

2 if V* > 0 A 

3 then return argma x qe u A ^ |Vi — 1|; 

4 if Vi < 1 

5 then return q 3 , t -i — 1; 

6 return q 3>t -i + 1; 

where 0 A is a threshold parameter. 

3.1 Reinforcement Learning Implementation 

Using defined level 0 policies as the starting point, we now bootstrap up to higher levels by training 
each level K policy against an opponent playing level K — 1 policy. To find policies that maximize 
reward, we can apply any algorithm from the reinforcement learning literature. In this paper, we use 
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an e-greedy policy parameterization (with e = 0.1) and SARSA on-policy learning [8]. Training 
updates are performed epoch-wise to improve stability. Since the players’ input spaces contain 
continuous variables, we use a neural-network to approximate the Q-function [9]. We improve 
performance by scheduling the exploration parameter c in 3 segments during training: An e of near 
unity, followed by a linearly decreasing segment, then finally the desired e. 

3.2 Results and Discussion 

We present results of the defender and attacker’s behavior at various level K. We note that our 
scenario always had an attacker present, so the defender is trained to combat the attacker and has no 
training concerning how to detect an attack or how to behave if no attacker is present. Notionally, 
this is also true for the attacker’s training. However in real-life the attacker will likely know that 
there is someone trying to thwart this attack. 

Level 0 defender vs. level 0 attacker The level 0 defender (see Figure 4(a)) tries to keep both 
Vj and V3 close to 1.0 to maximize his immediate reward. Because the defender makes steps in V\ 
of 0.02, he does nothing for 30 < t < 60 because any such move would not increase his reward. 
For 30 A / A 60, the P 2 , Q 2 noise causes 1 ) to fluctuate, and the attacker seems to randomly drift 
back and forth in response. At t = 60, the noise plus the attacker and defender actions breaks 
this ‘‘symmetry”, and the attacker increases his r/3 output causing V 2 and V3 to rise. The defender 
responds by decreasing Vl, indicated by the abrupt drops in V 2 and V 3 that break up the relatively 
smooth upward ramp. Near t = 75, the accumulated drift of the level 0 attacker plus the response 
of the level 0 defender pushes the system to the edge. The attacker sees that a strike would be 
successful (i.e., post-strike V 2 < 1 — #4), and the level 0 defender policy fails badly. The resulting 
V 2 and V3 are quite low, and the defender ramps V\ back up to compensate. Post strike (t > 
75), the attackers threshold criterion tells him that an immediate second strike would would not be 
successful, however, this shortcoming will be resolved via level 1 reinforcement learning. Overall, 
this is the behavior we have built into the level 0 players. 

Level 1 defender vs. level 0 attacker During the level 1 training, the defender likely experiences 
the type of attack shown in Figure 4(a) and learns that keeping V\ a step or two above 1.0 is a good 
way to keep the attacker from putting the system into a vulnerable state. As seen in Figure 4(b), the 
defender is never tricked into performing a sustained drift because the defender is willing to take 
a reduction to his reward by letting V3 stay up near 1.05. For the most part, the level 1 defender’s 
reinforcement learning effectively counters the level 0 attacker drift-and-strike policy. 

Level 0 defender vs. level 1 attacker The level 1 attacker learning sessions correct a shortcoming 
in the level 0 attacker. After a strike (V 2 < 0.95 in Figure 4(a)), the level 0 attacker drifts up from 
his largest negative q 3 output. In Figure 4(c), the level 1 attacker anticipates that the increase in V 2 
when he moves from in = —5 to in = 5 will cause the level 0 defender to drop V\ on the next move. 
After this drop, the level 1 attacker also drops from ra = +5 to —5. In essence, the level 1 attacker 
is leveraging the anticipated moves of the level 0 defender to create oscillatory strikes that push V 2 
below 1 — c nearly every cycle. 
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